Voice processing method, apparatus, device and storage medium

ABSTRACT

The present application provides a voice processing method, an apparatus, a device, and a storage medium, including: acquiring a first acoustic feature of each of N voice frames, where N is a positive integer greater than 1; applying a neural network algorithm to N first acoustic features to obtain a first mask; modifying the first mask according to VAD information of the N voice frames to obtain a second mask; and processing the N first acoustic features according to the second mask to obtain a second acoustic feature, resulting in more effective noise suppression and a lower damage to the voice.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to Chinese Patent Application No.201810595783.0, filed on Jun. 11, 2018, which is hereby incorporated byreference in its entirety.

TECHNICAL FIELD

The present application relates to the field of voice processingtechnologies and, in particular, to a voice processing method, anapparatus, a device, and a storage medium.

BACKGROUND

In a low-noise situation, the human auditory system can identify thesound of interest in a noisy environment. This phenomenon is called the“cocktail party effect”, which is often technically described as a blindsource separation problem, that is, to separate a foreground sound ofinterest from a noisy background sound in the absence of a referencesignal.

The main technical means for blind source separation is mask estimationand to process an acoustic feature via the mask. The neural networkalgorithm is currently used to estimate the mask. For example, the maskis estimated with

${{mask}\mspace{11mu} \left( {t,f} \right)} = \sqrt{\frac{\sigma_{x}^{2}\left( {t,f} \right)}{\sigma_{y}^{2}\left( {t,f} \right)}}$

for the characteristic of the amplitude spectrum of Fast FourierTransform (FFT) of voice, where t represents the t-^(th) voice frame, frepresents the f-^(th) frequency point, σ_(x) ²(t, f) represents powerof a clean voice at the time-frequency point (t, f), and σ_(y) ²(t, f)represents power of a noisy voice at the time-frequency point (t, f).However, in practice, the clean voice still carries noise. As a result,the estimated mask is not accurate enough, which in turn results in apoor voice processing effect.

SUMMARY

In order to solve the above technical problem, the present applicationprovides a voice processing method, an apparatus, a device, and astorage medium, where a mask is modified according to voice activitydetection (VAD) information, thereby eliminating a large number ofdiscrete masks, resulting in a more effective noise suppression effectand a lower damage to the voice.

In a first aspect, the present application provides a voice processingmethod, including: acquiring a first acoustic feature of each of N voiceframes, where N is a positive integer greater than 1; applying a neuralnetwork algorithm to N first acoustic features to obtain a first mask;modifying the first mask according to voice activity detection (VAD)information of the N voice frames to obtain a second mask; andprocessing the N first acoustic features according to the second mask toobtain a second acoustic feature.

The present application is beneficial in that: a mask is modifiedaccording to VAD information, which eliminates a large number ofdiscrete masks, and an acoustic feature is processed with the modifiedmask, which may improve the effectiveness of noise suppression and lowerthe damage to the voice.

Optionally, the modifying the first mask according to the VADinformation of the N voice frames includes: calculating a product of theVAD information and the first mask to obtain the second mask. The firstmask may be effectively modified with this method.

Optionally, the VAD information includes: a VAD value corresponding toeach of the voice frames. When the N voice frames include a silenceframe, set a VAD value corresponding to the silence frame to zero. TheVAD information may be determined to modify the first mask with thismethod.

Optionally, the VAD information includes a VAD value corresponding toeach of the voice frames. Correspondingly, before the modifying thefirst mask according to VAD information of the N voice frames, themethod further includes: determining M1 voice frames having a VAD valueof 1 and P1 voice frames having a VAD value of 0 from the N voiceframes, where the M1 voice frames are adjacent to the P1 voice frames,and where M1 and P1 are positive integers greater than 1; smoothing theVAD value corresponding to M2 voice frames of the M1 voice frames andthe VAD value corresponding to P2 voice frames of the P1 voice frames,such that the VAD value corresponding to the M2 voice frames and the VADvalue corresponding to the P2 voice frames are changed gradually from 0to 1 or from 1 to 0, where the M2 voice frames are adjacent to the P2voice frames, and where 1≤M2≤M1, 1≤P2≤P1. The VAD information may bedetermined to modify the first mask with this method.

Optionally, the determining the M1 voice frames having the VAD value of1 and the P1 voice frames having the VAD value of 0 from the N voiceframes includes: determining a call type corresponding to each of the Nvoice frames, where the type includes silence and phone; determining avoice frame having the silence type as a voice frame having the VADvalue of 0; and determining a voice frame having the phone type as avoice frame having the VAD value of 1.

Optionally, M2 and P2 are determined by a hamming window, a triangularwindow or a hanning window.

A voice processing apparatus, device, a storage medium and a computerprogram product will be provided below, for effects thereof, referencemay be made to the effects in the above description of the method, anddetails will not be described herein.

In a second aspect, the present application provides a voice processingapparatus, including:

an acquiring module, configured to acquire a first acoustic feature ofeach of N voice frames, where N is a positive integer greater than 1;

a training module, configured to apply a neural network algorithm to Nfirst acoustic features to obtain a first mask;

a modification module, configured to modify the first mask according tovoice activity detection (VAD) information of the N voice frames toobtain a second mask; and

a first processing module, configured to process the N first acousticfeatures according to the second mask to obtain a second acousticfeature.

Optionally, the modification module is specifically configured tocalculate a product of the VAD information and the first mask to obtainthe second mask.

Optionally, the VAD information includes: a VAD value corresponding toeach of the voice frames. Correspondingly, the apparatus furtherincludes: a setting module, configured to, when the N voice framesinclude a silence frame, set a VAD value corresponding to the silenceframe to zero.

Optionally, the VAD information includes a VAD value corresponding toeach of the voice frames. Correspondingly, the apparatus furtherincludes:

-   -   a determining module, configured to determine M1 voice frames        having a VAD value of 1 and P1 voice frames having a VAD value        of 0 from the N voice frames, where the M1 voice frames are        adjacent to the P1 voice frames, and where M1 and P1 are        positive integers greater than 1; and

a second processing module, configured to smooth the VAD valuecorresponding to M2 voice frames of the M1 voice frames and the VADvalue corresponding to P2 voice frames of the P1 voice frames, such thatthe VAD value corresponding to the M2 voice frames and the VAD valuecorresponding to the P2 voice frames are changed gradually from 0 to 1or from 1 to 0, where the M2 voice frames are adjacent to the P2 voiceframes, and where 1≤M2≤M1, 1≤P2≤P1.

Optionally, the determining module is specifically configured to:determine a call type corresponding to each of the N voice frames, wherethe type includes silence and phone; determine a voice frame having thesilence type as a voice frame having the VAD value of 0; and determine avoice frame having the phone type as a voice frame having the VAD valueof 1.

Optionally, M2 and P2 are determined by a hamming window, a triangularwindow or a hanning window.

In a third aspect, the present application provides a voice processingdevice, including: a memory and a processor.

The memory is configured to store instructions executed by theprocessor, such that the processor performs the voice processing methodaccording to the first aspect or any optional manner of the firstaspect.

In a fourth aspect, the present application provides a storage medium,including computer executable instructions for implementing the voiceprocessing method according to the first aspect or any optional mannerof the first aspect.

In a fifth aspect, the present application provides a computer programproduct including: computer executable instructions for implementing thevoice processing method according to the first aspect or any optionalmanner of the first aspect.

The present application provides a voice processing method, anapparatus, a device, and a storage medium, including: acquiring a firstacoustic feature of each of N voice frames, where N is a positiveinteger greater than 1; applying a neural network algorithm to N firstacoustic features to obtain a first mask; modifying the first maskaccording to VAD information of the N voice frames to obtain a secondmask; and processing the N first acoustic features according to thesecond mask to obtain a second acoustic feature. A mask is modifiedaccording to VAD information, which eliminates a large number ofdiscrete masks; and an acoustic feature is processed with the modifiedmask, which may improve the effectiveness of noise suppression and lowerthe damage to the voice.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a flowchart of a voice processing method according to anembodiment of the present application;

FIG. 2 is a flowchart of a voice processing method according to anotherembodiment of the present application;

FIG. 3 is a schematic diagram of smoothing a VAD value according to anembodiment of the present application;

FIG. 4 is a schematic diagram of a voice processing apparatus 400according to an embodiment of the present application; and

FIG. 5 is a schematic diagram of a voice processing device 500 accordingto an embodiment of the present application.

DESCRIPTION OF EMBODIMENTS

As noted above, the cocktail party effect is often technically describedas a blind source separation problem, that is, to separate a foregroundsound of interest from a noisy background sound in the absence of areference signal.

Blind source separation can be applied to the following scenarios:

Scenario 1: Extract the voice of a target speaker from the voice ofmultiple speakers. For example, a user is attempting to interact with asmart audio device on the coffee table, when the news is broadcast onthe TV in the living room. The audio device would receive both theuser's voice request and the news broadcast from the host. That is tosay, there are two people speaking at the same time.

Scenario 2: Separate the voice from background noise. For example, whenthe driver is driving, the microphone of the car or mobile phone wouldreceive various noises, such as wind noise, road noise, whistles, etc.Blind source separation can suppress these environmental noises and onlyextract the driver's voice.

Blind source separation is actually a regression model. If theperformance of the model is not ideal, the following flaws will appear:

1. Background sounds are not eliminated. That is to say, the blindsource separation is poor in terms of noise cancellation, and theperformance of noise suppression is low.

2. The target voice is also cancelled. That is to say, not only noises,but also the target voice is suppressed by the blind source separation.

3. The noises are not completely cancelled and the target voice isdamaged. Such situation is most common, that is, noises still exist atsome time-frequency points, while the target voice is cancelled at someother time-frequency points.

Accordingly, the two key technologies for blind source separation arenoise suppression and no damage to the target voice. A good blind sourceseparation should be able to suppress a background noise to the mostextent with minimal damage to the target voice.

The key to blind source separation is mask calculation. In the priorart, for an acoustic feature of each voice frame, a neural network isused to predict an output vector between 0 and 1, where the outputvector is a mask.

The acoustic feature described above may be an amplitude spectrum of theFFT, Mel-frequency Cepstrum Coefficients (MFCC), a Mel-scale Filter Bank(FBank), Perceptual Linear Predictive (PLP), or the like.

For example, for the feature of the amplitude spectrum of the FFT ofvoice, the mask is estimated by

${{{mask}\mspace{11mu} \left( {t,f} \right)} = \sqrt{\frac{\sigma_{x}^{2}\left( {t,f} \right)}{\sigma_{y}^{2}\left( {t,f} \right)}}},$

where t represents the t-^(th) voice frame and f represents the f-thfrequency point, σ_(x) ²(t, f) represents power of a clean voice at thetime-frequency point (t, f), σ_(y) ²(t, f) represents power of a noisyvoice at the time-frequency point (t, f). However, in practice, theclean voice still carries noise. As a result, the estimated mask is notaccurate enough, which in turn results in a poor voice processingeffect.

In order to solve the technical problem, the present applicationprovides a voice processing method, an apparatus, a device, and astorage medium. The technical solution of the present application isapplied to, but not limited to, the application scenarios of the blindsource separation described above. Specifically, FIG. 1 is a flowchartof a voice processing method according to an embodiment of the presentapplication. The executive agent for the method may be a part of or theentire smart terminal, such as a computer, a mobile phone, or a notebookcomputer. The method will be described hereunder by taking the computeras the executive agent for the method. As shown in FIG. 1, the voiceprocessing method includes the following steps:

Step S101: acquiring a first acoustic feature of each of N voice frames,where N is a positive integer greater than 1;

Step S102: applying a neural network algorithm to N first acousticfeatures to obtain a first mask;

Step S103: modifying the first mask according to VAD information of theN voice frames to obtain a second mask; and

Step S104: processing the N first acoustic features according to thesecond mask to obtain a second acoustic feature.

The description of Step S101 is as follows:

The first acoustic feature may be any one of an amplitude spectrum ofthe FFT, MFCC, FBank or PLP, which is not limited in the presentapplication. Actually, the first acoustic feature of each of the N voiceframes constitutes a first acoustic feature vector, where the vectorincludes N elements, each of which is the first acoustic featurecorresponding to each of the N voice frames respectively.

The description of Step S102 is as follows:

It should be noted that the neural network algorithm involved in thepresent application is a neural network algorithm used in the prior artfor mask calculation, which is not limited in this application.

Further, as described above, the neural network algorithm is applied tothe N first acoustic features to obtain a first mask, where the firstmask is a vector including N components, and the N components arerespectively corresponding to the N first acoustic features, with eachof the N components has a value range of [0, 1].

The description of Step S103 is as follows:

VAD is also known as speech endpoint detection, speech edge detection,etc. It refers to detecting the presence of voice in a noisyenvironment. It is usually used in voice processing systems such asvoice coding and voice enhancement to reduce the voice coding rate, savecommunication bandwidth, reduce energy consumption of mobile devices,and improve identification rate.

The VAD involved in the present application may be preset or may bedetermined according to a call type of a voice frame, where the type maybe silence or phone.

The method of determining the VAD according to the call type of thevoice frame is as follows:

Alternatively, the VAD information includes: a VAD value correspondingto each of the N voice frames; when the N voice frames include a silenceframe, setting the VAD value corresponding to the silence frame to 0,and on the contrary, when the N voice frames include a phone frame,setting the VAD value corresponding to the phone frame to be greaterthan 0 but less than or equal to 1. The so-called “silence frame” refersto a voice frame with a silence type. The so-called “phone frame” refersto a voice frame with a phone type.

Alternatively, the modifying the first mask according to the VADinformation of the N voice frames includes: calculating a product of theVAD information and the first mask to obtain a second mask, orcalculating a product of the VAD information, the first mask, and apreset coefficient to obtain a second mask. This application does notlimit how to obtain the second mask. The second mask is also a vectorincluding N components, and the N components are respectivelycorresponding to the N first acoustic features, with each of the Ncomponents has a value range of [0, 1]. The preset coefficient may begreater than 0 but less than or equal to 1.

Accordingly, when a VAD value is 0, a component of the correspondingsecond mask is also 0. In the present application, this modification isreferred to as a hard modification approach.

Optionally, the VAD information includes a VAD value corresponding toeach of the N voice frames. Correspondingly, FIG. 2 is a flowchart of avoice processing method according to another embodiment of the presentapplication. As shown in FIG. 2, before Step S103, the voice processingmethod further includes:

Step S1031: determining M1 voice frames having a VAD value of 1 and P1voice frames having a VAD value of 0 from the N voice frames, where theM1 voice frames are adjacent to the P1 voice frames; and

Step S1032: smoothing the VAD value corresponding to M2 voice frames ofthe M1 voice frames and the VAD value corresponding to P2 voice framesof the P1 voice frames, such that the VAD value corresponding to the M2voice frames and the VAD value corresponding to the P2 voice frames arechanged gradually from 0 to 1 or from 1 to 0, where the M2 voice framesare adjacent to the P2 voice frames.

Step S1031 is described as follows: where both M1 and P1 are positiveintegers greater than 1, and M1+P1=N. Specifically, first, determining acall type corresponding to each of the N voice frames, where the typeincludes silence and phone; determining a voice frame having the silencetype as a voice frame having the VAD value of 0; and determining a voiceframe having the phone type as a voice frame having the VAD value of 1.

In this application, a “forced alignment” approach may be used todetermine a call type corresponding to each of the N voice frames. Theso-called “forced alignment” refers to determining the start and endtime of each type, for example, which voice frame or voice frames arecorresponding to a certain type. For instance, first M1 voice frames ofthe N voice frames are forcibly aligned to the silence type, and P1voice frames following the M1 voice frames are forcibly aligned to thephone type. It should be noted that it is merely an example that thefirst M1 voice frames of the N voice frames are forcibly aligned to thesilence type and the P1 voice frames following the M1 voice frames areforcibly aligned to the phone type. In fact, the N voice frames aresequentially composed of N1 voice frames having the silence type, N2voice frames having the phone type, N3 voice frames having the silencetype, . . . Nn voice frames having the phone type, where N1+N2+ . . .Nn=N, and N1, N2, Nn are all integers greater than or equal to 0, whichis not limited in the present application.

Step S1032 is described as follows: 1≤M2≤M1, 1≤P2≤P1, optionally, M2 andP2 are determined by a hamming window, a triangular window or a hanningwindow. Preferably, M2+P2=10. FIG. 3 is a schematic diagram of smoothinga VAD value according to an embodiment of the present application. Asshown in FIG. 3, the 0th voice frame to the 30^(th) voice frame aresilence frames, that is, their respective VAD values are 0; the 31^(st)voice frame to the 280^(th) voice frame are phone frames, that is, theirrespective VAD values are 1; and the 281^(th) voice frame to the300^(th) voice frame are silence frames again, that is, their respectiveVAD values are 0. The following is a smoothing process performed onvoice frames from the 20^(th) voice frame to the 40^(th) voice frame,which may specifically include: determining the corresponding pointcoordinate (20, 0) of the 20th voice frame and the corresponding pointcoordinate (40, 1) of the 40^(th) voice frame, and determining astraight line according to the two points, where the straight line is aresult of the smoothing process on the voice frames from the 20^(th)voice frame to the 40^(th) voice frame. Accordingly, for the voiceframes from the 20^(th) voice frame to the 40^(th) voice frame, theirVAD values are gradually changed from 0 to 1. Similarly, a smoothingprocess is performed on voice frames from the 260^(th) voice frame tothe 290^(th) voice frame, which may specifically include: determiningthe corresponding point coordinate (260, 1) of the 260^(th) voice frameand the corresponding point coordinate (290, 0) of the 290^(th) voiceframe, and determining a straight line according to the two points,where the straight line is a result of the smoothing process on thevoice frames from the 260^(th) voice frame to the 290^(th) voice frame.Accordingly, for the voice frames from the 260^(th) voice frame to the290^(th) voice frame, their VAD values are gradually changed from 1 to0.

In the present application, this alternative is referred to as a softmodification approach.

The description of Step S104 is as follows:

Alternatively, processing the N first acoustic features according to thesecond mask to obtain N second acoustic features. Assume that any of thesecond acoustic features is denoted as estimate, a first acousticfeature corresponding to the second acoustic feature is denoted asnoise, and a component corresponding to the first acoustic feature ofthe second mask is denoted as h, then estimate=noise*h, where *represents the multiplication.

Alternatively, processing the N first acoustic features according to thesecond mask to obtain one second acoustic feature. Assume that thesecond acoustic feature is denoted as estimate, the N first acousticfeatures are denoted as noise(N), where noise(NN) is a vector consistingof the N first acoustic features, components corresponding to the firstacoustic features of the second mask are denoted as h(N), thenestimate=(noise(N)*(h(N))^(T). Where * represents the product of thevectors and (h(N))^(T) represents the transpose of h(N).

In view of the above, the present application provides a voiceprocessing method. The key technology of the method is to modify a maskaccording to VAD information, thereby eliminating a large number ofdiscrete masks, resulting in more effective noise suppression and alower damage to voice.

FIG. 4 is a schematic diagram of a voice processing apparatus 400according to an embodiment of the present application. As shown in FIG.4, the voice processing apparatus may be a part of or the entirecomputer, tablet, or mobile phone. For example, the apparatus may be acomputer or a processor or the like, where the apparatus includes:

an acquiring module 401, configured to acquire a first acoustic featureof each of N voice frames, where N is a positive integer greater than 1;

a training module 402, configured to apply a neural network algorithm toN first acoustic features acquired by the acquiring module 401 to obtaina first mask;

a modification module 403, configured to modify, according to VADinformation of the N voice frames, the first mask obtained by thetraining module 402 to obtain a second mask; and

a first processing module 404, configured to process the N firstacoustic features according to the second mask obtained by themodification module 403 to obtain a second acoustic feature.

Optionally, the modification module 403 is configured to:

calculate a product of the VAD information and the first mask to obtainthe second mask.

Optionally, the VAD information includes: a VAD value corresponding toeach of the voice frames. Correspondingly, the apparatus furtherincludes:

a setting module 405, configured to, when the N voice frames include asilence frame, set a VAD value corresponding to the silence frame to 0.

Optionally, the VAD information includes: a VAD value corresponding toeach of the voice frames.

Correspondingly, the apparatus further includes:

a determining module 406, configured to determine M1 voice frames havinga VAD value of 1 and P1 voice frames having a VAD value of 0 from the Nvoice frames, where the M1 voice frames are adjacent to the P1 voiceframes, and where M1 and P1 are positive integers greater than 1;

a second processing module 407, configured to smooth the VAD valuecorresponding to M2 voice frames of the M1 voice frames and the VADvalue corresponding to P2 voice frames of the P1 voice frames, such thatthe VAD value corresponding to the M2 voice frames and the VAD valuecorresponding to the P2 voice frames are changed gradually from 0 to 1or from 1 to 0, where the M2 voice frames are adjacent to the P2 voiceframes, and where 1≤M2≤M1, 1≤P2≤P1.

Optionally, the determining module 406 is specifically configured to:determine determining a call type corresponding to each of the N voiceframes, where the type includes silence and phone; determine a voiceframe having the silence type as a voice frame having the VAD value of0; and determine a voice frame having the phone type as a voice framehaving the VAD value of 1.

Optionally, M2 and P2 are determined by a hamming window, a triangularwindow or a hanning window.

The present application provides a voice processing apparatus, which canbe used in the voice processing method described above. For contents andeffects of the apparatus, reference may be made to the description ofthe method embodiment, which is not repeated herein.

FIG. 5 is a schematic diagram of a voice processing device 500 accordingto an embodiment of the present application. The voice processing devicemay be a smart device such as a computer, a tablet, or a mobile phone.As shown in FIG. 5, the device includes:

a memory 501 and a processor 502, where the memory 501 is configured tostore an instruction executed by the processor 502, such that theprocessor 502 executes the voice processing method described above.

Optionally, the device further includes: a transceiver 503, configuredto achieve communications between the device 500 and other devices.

The memory 501 may be implemented by any type of volatile ornon-volatile storage device, or a combination thereof, such as staticrandom access memory (Static Random Access Memory, SRAM), electricallyerasable programmable read-only memory (Electrically ErasableProgrammable Read-Only Memory, EEPROM), erasable programmable read-onlymemory (Erasable Programmable Read-Only Memory, EPROM), programmableread-only memory (Programmable read-only memory, PROM), read-only memory(Read-Only Memory, ROM), magnetic memory, flash memory, magnetic disk orcompact disk.

The processor 502 may be implemented by one or more application specificintegrated circuits (Application Specific Integrated Circuit, ASIC),digital signal processors (Digital Signal Processor, DSP), digitalsignal processing devices (Digital Signal Processing Device, DSPD),programmable logic devices (Programmable Logic Device, PLD),field-programmable gate arrays (Field-Programmable Gate Array, FPGA),controllers, microcontrollers, microprocessors or other electronicelements.

The processor 502 is configured to perform the following method:acquiring a first acoustic feature of each of N voice frames, where N isa positive integer greater than 1; applying a neural network algorithmto N first acoustic features to obtain a first mask; modifying the firstmask according to VAD information of the N voice frames to obtain asecond mask; and processing the N first acoustic features according tothe second mask to obtain a second acoustic feature.

Optionally, the processor 502 is further configured to calculate aproduct of the VAD information and the first mask to obtain the secondmask.

Optionally, the VAD information includes: a VAD value corresponding toeach of the voice frames. Correspondingly, the processor 502 is furtherconfigured to, when the N voice frames includes a silence frame, set aVAD value corresponding to the silence frame to zero.

Optionally, the VAD information includes: a VAD value corresponding toeach of the voice frames. The processor 502 is further configured todetermine M1 voice frames having a VAD value of 1 and P1 voice frameshaving a VAD value of 0 from the N voice frames, where the M1 voiceframes are adjacent to the P1 voice frames, and where M1 and P1 arepositive integers greater than 1; smoothing the VAD value correspondingto M2 voice frames of the M1 voice frames and the VAD valuecorresponding to P2 voice frames of the P1 voice frames, such that theVAD value corresponding to the M2 voice frames and the VAD valuecorresponding to the P2 voice frames are changed gradually from 0 to 1or from 1 to 0, where the M2 voice frames are adjacent to the P2 voiceframes, and where 1≤M2≤M1, 1≤P2≤P1.

Optionally, the processor 502 is specifically configured to: determine acall type corresponding to each of the N voice frames, where the typeincludes silence and phone; determine a voice frame having the silencetype as a voice frame having the VAD value of 0; and determine a voiceframe having the phone type as a voice frame having the VAD value of 1.

Optionally, M2 and P2 are determined by a hamming window, a triangularwindow or a hanning window.

The present application provides a voice processing device, which may beused in the voice processing method described above. For contents andeffects of the voice processing device, reference may be made to thedescription of the method embodiment, which is not repeated herein.

The application also provides a storage medium, including: computerexecutable instructions for implementing the voice processing methoddescribed above. For contents and effects of the storage medium,reference may be made to the description of the method embodiment, whichis not repeated herein.

The application also provides a computer program product, including:computer executable instructions for implementing the voice processingmethod described above. For contents and effects of the computer programproduct, reference may be made to the description of the methodembodiment, which is not repeated herein.

It will be understood by those skilled in the art that all or part ofthe steps implementing the above method embodiments may be performed byhardware associated with program instructions. The aforementionedprogram may be stored in a computer readable medium. When the program isexecuted, the steps including the foregoing method embodiments areperformed; and the foregoing medium includes a medium that can storeprogram codes, such as an ROM, an RAM, a magnetic disk, or a compactdisk.

It should be noted that the above embodiments are merely used toillustrate, but are not intended to limit the technical solutions of thepresent application. Although the present application has been describedin detail with reference to the foregoing embodiments, those skilled inthe art will understand that the technical solutions described in theforegoing embodiments may be modified, or some or all of the technicalfeatures may be substituted with other equivalents, however, thesemodifications or substitutions do not make the essence of theircorresponding technical solutions deviate from the scope of thetechnical solutions of the embodiments of the present application.

What is claimed is:
 1. A voice processing method, comprising: acquiringa first acoustic feature of each of N voice frames, wherein N is apositive integer greater than 1; applying a neural network algorithm toN first acoustic features to obtain a first mask; modifying the firstmask according to voice activity detection (VAD) information of the Nvoice frames to obtain a second mask; and processing the N firstacoustic features according to the second mask to obtain a secondacoustic feature.
 2. The method according to claim 1, wherein themodifying the first mask according to the VAD information of the N voiceframes comprises: calculating a product of the VAD information and thefirst mask to obtain the second mask.
 3. The method according to claim1, wherein the VAD information comprises a VAD value corresponding toeach of the voice frames; and when the N voice frames comprise a silenceframe, setting a VAD value corresponding to the silence frame to zero.4. The method according to claim 1, wherein the VAD informationcomprises a VAD value corresponding to each of the voice frames; andcorrespondingly, before the modifying the first mask according to VADinformation of the N voice frames, the method further comprises:determining M1 voice frames having a VAD value of 1 and P1 voice frameshaving a VAD value of 0 from the N voice frames, wherein the M1 voiceframes are adjacent to the P1 voice frames, and wherein M1 and P1 arepositive integers greater than 1; and smoothing the VAD valuecorresponding to M2 voice frames of the M1 voice frames and the VADvalue corresponding to P2 voice frames of the P1 voice frames, such thatthe VAD value corresponding to the M2 voice frames and the VAD valuecorresponding to the P2 voice frames are changed gradually from 0 to 1or from 1 to 0, wherein the M2 voice frames are adjacent to the P2 voiceframes, and wherein 1≤M2≤M1, 1≤P2≤P1.
 5. The method according to claim4, wherein the determining the M1 voice frames having the VAD value of 1and the P1 voice frames having the VAD value of 0 from the N voiceframes comprise: determining a call type corresponding to each of the Nvoice frames, wherein the type comprises silence and phone; determininga voice frame having the silence type as a voice frame having the VADvalue of 0; and determining a voice frame having the phone type as avoice frame having the VAD value of
 1. 6. The method according to claim5, wherein M2 and P2 are determined by a hamming window, a triangularwindow or a hanning window.
 7. A voice processing device, comprising: amemory and a processor; wherein the memory is configured to storeinstructions executed by the processor, such that the processorperforms: acquiring a first acoustic feature of each of N voice frames,wherein N is a positive integer greater than 1; applying a neuralnetwork algorithm to N first acoustic features to obtain a first mask;modifying the first mask according to voice activity detection (VAD)information of the N voice frames to obtain a second mask; andprocessing the N first acoustic features according to the second mask toobtain a second acoustic feature.
 8. The device according to claim 7,wherein the processor is configured to: calculate a product of the VADinformation and the first mask to obtain the second mask.
 9. The deviceaccording to claim 7, wherein the VAD information comprises a VAD valuecorresponding to each of the voice frames; and correspondingly, theprocessor is further configured to: when the N voice frames comprise asilence frame, set a VAD value corresponding to the silence frame tozero.
 10. The device according to claim 7, wherein the VAD informationcomprises a VAD value corresponding to each of the voice frames; andcorrespondingly, the processor is further configured to: determine M1voice frames having a VAD value of 1 and P1 voice frames having a VADvalue of 0 from the N voice frames, wherein the M1 voice frames areadjacent to the P1 voice frames, and wherein M1 and P1 are positiveintegers greater than 1; and smooth the VAD value corresponding to M2voice frames of the M1 voice frames and the VAD value corresponding toP2 voice frames of the P1 voice frames, such that the VAD valuecorresponding to the M2 voice frames and the VAD value corresponding tothe P2 voice frames are changed gradually from 0 to 1 or from 1 to 0,wherein the M2 voice frames are adjacent to the P2 voice frames, andwherein 1≤M2≤M1, 1≤P2≤P1.
 11. The device according to claim 10, whereinthe processor is configured to: determine a call type corresponding toeach of the N voice frames, wherein the type comprises silence andphone; determine a voice frame having the silence type as a voice framehaving the VAD value of 0; and determine a voice frame having the phonetype as a voice frame having the VAD value of
 1. 12. The deviceaccording to claim 11, wherein M2 and P2 are determined by a hammingwindow, a triangular window or a hanning window.
 13. A non-volatilestorage medium, comprising: computer executable instructions forimplementing following steps: acquiring a first acoustic feature of eachof N voice frames, wherein N is a positive integer greater than 1;applying a neural network algorithm to N first acoustic features toobtain a first mask; modifying the first mask according to voiceactivity detection (VAD) information of the N voice frames to obtain asecond mask; and processing the N first acoustic features according tothe second mask to obtain a second acoustic feature.
 14. Thenon-volatile storage medium according to claim 13, wherein the computerexecutable instructions are configured to implement: calculating aproduct of the VAD information and the first mask to obtain the secondmask.
 15. The non-volatile storage medium according to claim 13, whereinthe VAD information comprises a VAD value corresponding to each of thevoice frames; and correspondingly, the computer executable instructionsare further configured to implement: when the N voice frames comprise asilence frame, setting a VAD value corresponding to the silence frame tozero.
 16. The non-volatile storage medium according to claim 13, whereinthe VAD information comprises a VAD value corresponding to each of thevoice frames; and correspondingly, the computer executable instructionsare further configured to implement: determining M1 voice frames havinga VAD value of 1 and P1 voice frames having a VAD value of 0 from the Nvoice frames, wherein the M1 voice frames are adjacent to the P1 voiceframes, and wherein M1 and P1 are positive integers greater than 1; andsmoothing the VAD value corresponding to M2 voice frames of the M1 voiceframes and the VAD value corresponding to P2 voice frames of the P1voice frames, such that the VAD value corresponding to the M2 voiceframes and the VAD value corresponding to the P2 voice frames arechanged gradually from 0 to 1 or from 1 to 0, wherein the M2 voiceframes are adjacent to the P2 voice frames, and wherein 1≤M2≤M1,1≤P2≤P1.
 17. The non-volatile storage medium according to claim 16,wherein the computer executable instructions are configured toimplement: determining a call type corresponding to each of the N voiceframes, wherein the type comprises silence and phone; determining avoice frame having the silence type as a voice frame having the VADvalue of 0; and determining a voice frame having the phone type as avoice frame having the VAD value of
 1. 18. The non-volatile storagemedium according to claim 17, wherein M2 and P2 are determined by ahamming window, a triangular window or a hanning window.